arXiv:1507.08377vl [stat.ME] 30Jul2015 


Large Covariance Estimation through Elliptical Factor 

Models 


Jianqing Fan* Han Liu* and Weichen Wang* 

Department of Operations Research and Financial Engineering, Princeton University 


Abstract 

We proposed a general Principal Orthogonal complEment Thresholding (POET) 
framework for large-scale covariance matrix estimation based on an approximate factor 
model. A set of high level sufficient conditions for the procedure to achieve optimal 
rates of convergence under different matrix norms were brought up to better understand 
how POET works. Such a framework allows us to recover the results for sub-Gaussian 
in a more transparent way that only depends on the concentration properties of the 
sample covariance matrix. As a new theoretical contribution, for the first time, such a 
framework allows us to exploit conditional sparsity covariance structure for the heavy¬ 
tailed data. In particular, for the elliptical data, we proposed a robust estimator based 
on marginal and multivariate Kendall’s tau to satisfy these conditions. In addition, 
conditional graphical model was also studied under the same framework. The technical 
tools developed in this paper are of general interest to high dimensional principal 
component analysis. Thorough numerical results were also provided to back up the 
developed theory. 
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1 Introduction 


This paper considers large factor model based covariance matrix estimation for heavy¬ 
tailed data. Factor model is a powerful tool for dimension reduction and latent factor ex¬ 
traction, which gained its popularity in various applications from finance to biology. When 
applied to covariance matrix estimation, it assumes a conditional sparse covariance struc¬ 
ture, i.e., conditioning on the low dimensional spiked factors, the covariance matrix of the 


idiosyncratic errors is sparse. To be specific, consider the approximate factor model in Bai 
and Ng! (2002): 

Uit = b-fi + u it , (1.1) 

where y it is the observed data for the Ah (i — 1,... ,p) dimension at time t — 1,..., n; f t is 
an unknown m-dimensional vector of common factors, and bj is the factor loading for the Ah 
variable; uu is the idiosyncratic error, uncorrelated with the common factors. Previous works 
are limited by only considering Gaussian or sub-Gaussian factors and noises. In this paper 
we aim to extend this limitation and consider heavy-tailed distributions. More specifically, 
we will consider the case where factors and noises are elliptically distributed. Under this 
broader class of heavy tailed distributions, we aim to understand how to estimate covariance 
matrix accurately. 

Covariance matrix estimation has been pioneered by Bickel and Levina (2008afb) and 


Fan et al. (2008). After that, substantial amount of work has focused on the inference of 


high-dimensional covariance matrices under unconditional sparsity ( 

Cai and Liu, 2011 

Cai 

et al. 2013c, 2010; Karoui, 2008 

Lam and Fan, 2009; Ravikumar et al. 2011) or conditional 

sparsity (Amini and Wainwright 

2008 Berthet and Rigollet, 2013b a 

Birnbaum et al., 

2013 


Cai et ah, 

2013b a Johnstone and Lu 

2009 

Levina and Vershynin 2012 

Rothman et al. 

2009 

Ma, 

2013 

Slicn et ah, 

2013; 

Paul and Johnstone 

,2012 Vu and Lei, 

2012 

Zou et al. 


2006). This research area is very active, and as a result, this list of references is illustrative 


rather than comprehensive. To emphasize, Fan and his collaborators proposed to use factor 


model or conditional sparsity structure for covariance matrix estimation (Fan et al. 2008 


2011, 2013, 2014c). The model encompasses the situation of unconditional sparse covariance 


by setting the number of factors to zero. Thus it is more general and realistic given the fact 
that the observed data are usually driven by some common factors. 

Another line of research on robust covariance estimation also receives significant attention 


from the literature. The idea of robust estimation dates back to Huber (1964) and had been 


extended in regression problems with different types of loss function; see for example Fan 


et al. (2014b) and Catoni (2012). Recently, Han and Liu (2013b, 2014) introduce robust 


covariance matrix estimation to high-dimensional elliptical and transclliptical (or elliptical 
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copula) distribution family. In those papers, they proposed a robust procedure using the 
marginal Kendall’s tau statistics and proved its optimality for covariance matrix estimation 
under elliptical distributions. In addition, multivariate Kendall’s tau was also considered 


by Han and Liu (2013a) to estimate eigenspaces of covariance matrices in high dimensions. 


Those methods, applied to PCA or sparse PCA, can be potentially useful for dealing with 
factor models with heavy-tailed factors and noises. The goal of the current paper is to develop 
a unified theory that allows us to extend these robust rank-based covariance estimation 
procedures to handle heavy-tailed data with conditional covariance sparsity. 


1.1 Background on approximate factor model 

To illustrate how to use factor model as a dimension reduction tool for covariance matrix 


estimation, let us write model (1.1) in its vector form: 

y t = Bf) + u t , 


( 1 . 2 ) 


where y t contains all observed individuals at time t = 1,..., n and B = (b 1; ..., b p )' is the 


factor loading matrix. The matrix form of (1.1) is 


Y = BF' + U , 


(1.3) 


where Y pxn , B pxm , F nxm , U pxn are matrices from observed data, factor loadings, factors, 
and errors with Y = (yi,..., y n ), F = (fi,... , i n )' and U = (ui,..., u n ). Here we consider 
the case where the dimension p is larger than sample size n and for simplicity we assume 
n samples are independent and identically distributed in the sequel (An extension to the 
dependent setting is straightforward, but tedious.). We assume factor matrix F is observable. 


To make the model (1.1) identifiable, we impose the following conditions as in Bai and Ng 


(2013 

) and 

Bai and Li 

(2012) 


cov(f t ) = I and B'B is diagonal. 


(1.4) 


The conditions in (1.4) are common in the factor model literature. But we will point out 


in Section [2] that these conditions are sufficient only for asymptotic identifiability up to an 


error of order 0( 1/^/p) rather than exact identifiability. Under the conditions in (1.4), the 
covariance matrix of y t is 

E = cov(y f ) = BB' + T, u , (1.5) 


where E n is the covariance matrix of the idiosyncratic error u*. 
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1.2 Major contributions of this paper 


Under model (1.2), Fan et al. (2013) proposed the Principal Orthogonal complEment 
Thresholding (POET) estimator for E under the assumption that factors and noises are 
sub-Gaussian. By imposing the condition that the leading eigenvalues of E diverges at the 


rate of order p from their pervasiveness condition, Fan et ah (2013) proved the consistency 
of the POET estimator and showed its rates of convergence. However, their proofs are 
mathematically involved and do not transparently explain why POET works in estimating 


large covariance matrices. It has been pointed out by Fan and Wang (2015) how pervasive 


factors help in estimating the low-rank part BB ; in (1.5). The idea is further explored in 
this paper. A surprising result is that the diverging signal of spiked eigenvalues excludes the 
necessity of the sparse principal component assumption in sparse PGA literature, comparing 


with for example Cai et al. (2013b). 

The main contributions of the paper are two folds. On one hand, we summarize a uni¬ 


fied generic framework in Section 252 for applying POET to various potentially heavy-tailed 
distributions. The key Theorem 2.1 provides a set of high level interface conditions (1.6) 
explaining how to design a POET covariance estimator according to factor and error distribu¬ 
tions. POET regularization needs the following three components: initial pilot estimators for 
covariance matrix E, its leading eigenvalues A = diag(Ai,..., A m ) and their corresponding 
leading eigenvectors r pxm = (£i, • • •, £ m ). With these compoents, a generic POET estima¬ 
tor can be constructed. We will show that such a POET procedure attains desired rates of 
convergence as long as 

||S - S11 max = Op(\/logp/n) , 

||(A - A)A" 1 || max = 0 P (y/logp/n) , (1.6) 


IP — PI 


= 0 P (A/log p/(np)). 


These conditions are relatively easy to verify, as they involve only the componentwise maxi- 
mums. Through those sufficient conditions, we are able to separate the deterministic analysis 
of the estimation procedure and the probabilistic guarantee of the design of initial estimators. 

For two specific factor and error distributions, we provide methods to construct those 
initial estimators. For sub-Gaussian, it is natural to employ sample covariance matrix and 
its eigenvalues and eigenvectors as the estimates for E, A and T. We show the natural 
idea indeed achieves the above conditions for sub-Gaussian data, which gives an explanation 
why POET in previous literature works. However, for elliptical distributions, constructing 
estimators with the desired rates are highly nontrivial. We use the marginal Kendall’s tau to 
obtain E and A while a different method multivariate Kendall’s tau is applied to construct 
r. Notice an interesting fact that the generic POET procedure allows separately estimating 
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the eigenvectors and eigenvalues using different methods. Robust estimators are constructed 
for the first time for elliptical factor models. 

1.3 Notations 

Here are some useful notations. If M is a general matrix, we denote its matrix entry-wise 
maximum value as ||M|| max = maxjj \M itJ and define the quantities ||M || 2 = Amfx(M'M) (or 
||M|| for short), ||M|| F = {Ylij , ll M lloo = maxj^. \M id \ and ||M||i,i = 

to be its spectral, Frobenius, induced and element-wise t\ norms. If furthermore M is 
symmetric, we define Aj(M) to be the jth largest eigenvalue of M and A max (M), A min (M) 
to be the maximal and minimal eigenvalues respectively. We denote tr(M) to be the trace 
of M. For any vector v, its i 2 norm is represented by ||v|| while i\ norm is written as || v|| !. 
We denote diag(v) to be the diagonal matrix with the same diagonal entries as v. For two 
random matrices A, B of the same size, we say A = B + Op[8 ) if ||A — B|| = Op(S) and 
A = B + o P (6 ) if || A - B || = op(6). Similarly for two random vectors a, b of the same 
length, a = b + Op(5) if ||a — b|| = Op(5) and a = b + o P (5) if ||a — b|| = op(S). We denote 
a = b if random vectors a and b have the same distribution. I 11 the sequel, C is a generic 
constant that may differ from line to line. 

1.4 Paper organization 

In Section[2j we present a generic POET estimating procedure and a high-level theoretical 
interface which secures the consistency of the generic procedure for factor-based conditional 
sparsity mdoels. We verify that the conditions in Section [3] hold with high probability for sub- 
Gaussian data, which provides a transparent understanding of the mechanism of the POET 
methodology. In Section [4| we propose a new method using a combination of marginal and 
multivariate Kendall’s tau and prove its theoretical properties under elliptical factor models. 
Thorough numerical simulations are conducted illustrate the merits of our proposed method 
in Section [5| In Section [6j we conclude the paper with a short discussion. The technical 
proofs are relegated to the appendix. 

2 A High-level theoretical interface 

In this section, we summarize a generic POET procedure and provide a set of high level 
sufficient conditions for consistent covariance estimation when p n. Before doing that, 
let us review what has been achieved in the existing literature where both the factors and 
noises are assumed to be sub-Gaussian. 
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2.1 Spiked covariance model 


Assume the observed random variables {yi}™ =1 have zero mean and covariance matrix 
Sp Xp where the eigenvalues Ai, A 2 ,..., X p of £ are ordered in descending order. We con¬ 
sider the spiked population model as suggested by the approximate factor structure (1.5). 
Specifically we have the following assumption on the eigvenvalues. 


Assumption 2.1 (Spiked covariance model). Let m < min{n,p} be a fixed constant that 
does not change with n and p. As n —)• 00 , Ai > A 2 > • • • > A m ^ A m+1 > • • • > X p > 0, 
where the spiked eigenvalues are linearly proportional to dimension p while the non-spiked 
eigenvalues are bounded, i.e., Co < A j < C'o, j > m for constants cq,Cq > 0. In addition, the 
non-spiked eigenvalue average (p — m)~ l J2 ^= m +1 Aj = c + o(l). 


Assumption |2 .1 1 requires the eigenvalues be divided into the diverging and bounded ones. 
For simplicity, we only consider distinguishable eigenvalues (multiplicity 1) for the largest m 
eigenvalues. This assumption is typically satisfied by the factor model (1.1) with pervasive 
factors. More specifically, if the factor loadings {bj}^ =1 (the transpose of the rows of B) are 
an i.i.d. sample from a population with finite second moments, then by the strong law of 
large numbers, p^B'B = p _1 1 b y b) —* £& almost surely, where £& = E(b,-b'-). In other 

words, the eigenvalues of BB' are approximately 


pAi(Sb)(l + o(l)), • • • ,pA m (E&)(l + o(l)), 0, • • • ,0, 

where Aj(£b) is the jth eigenvalue of S/,. If we further assume that ||X U || is bounded, by 
Weyl’s theorem, we conclude 


A j = pAj(S b )(l + o(l)), for j = 1, • • • , m, 
and the remaining are bounded. 


( 2 . 1 ) 


2.2 A generic POET procedure for covariance estimation 


We see from (1.5) that the population covariance of the factor model (1.1) exhibits a 
low-rank plus sparse structure, if is sparse, whose sparsity level is measured by 


rrip := max ^ W Utij \ q 
j<p 

for some q G [0,1] is small. I 11 particular, with q — 0, m p corresponds to the maximum 
number of nonzero elements in each row of 
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To estimate the covariance matrix X with the approximate factor structure (1.5), Fan 


et al. (2013) proposed the POET method to recover the factor matrix as well as the factor 


loadings. The idea is to first decompose the sample covariance matrix into the spike and 
non-spike parts, 

^ n m 

n 


s E ^ > 

*=i 


( 2 . 2 ) 


2—1 


where X u = Y^= m +i ca H ec l the principal orthogonal complement. Then by em¬ 


ploying adaptive thresholding on X u to get X M (Cai and Liu, 2011|), they obtain a final 
covariance estimator X defined as 




(2.3) 


3 = 1 


The above procedure can be equivalently viewed as a least-squares approach. That is, the 
factor and loading matrices can be estimated by solving the following nonconvex minimiza¬ 
tion problem: 

(B, F) = arg min || Y — BF 7 ||p s.t. —F 7 F = I m , B 7 B is diagonal. (2.4) 

B,F 77/ 


ft is shown that the columns of F/ s/n are the eigenvectors corresponding to the m largest 
eigenvalues of the n x n matrix n _1 Y 7 Y and B = n -1 YF. Note that the estimator B 


given by minimizing (2.4), after normalization, is actually the first m empirical eigenvectors 
of n _1 YY'. Given B.F, we define U = Y — BF 7 and X M = n _1 UU 7 . Finally adaptive 
thresholding is applied to X, u to obtain X u = (<Ju,ij)pxp wifi 1 


^ u, ij 


Ou,ij •> ^ J 

Sij(Vu,ij)l(\Vu,ij\ ^ , iy^ j 


(2.5) 


where sp(-) is the generalized shrinkage function (Antoniadis and Fan, 2001; Rothman et al. 


2009) and = T(a u ^ i cr u jj) 1 ^ 2 is an entry-dependent threshold. The above adaptive thresh¬ 
old operator corresponds to applying thresholding with parameter r to the correlation matrix 
of X M . The positive parameter r will be determined based on theoretical analysis. 

Let w n = y/logp/n + l/y/p. Fan et al. (2013) claimed that under some technical assump¬ 


tions, with r x w n , if m p w * q = o(l), 


X; - XJ 2 = 0 P (m p w l - q ) = H(XI)- 1 - X 


~ 1 112 


( 2 . 6 ) 
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T 


= 0 , 


.(^ +m y r ), 


(2.7) 


\-l 


E 1 || 2 = 0 P [m n w l q 


where ||A|| S = p-^pT^ 2 AE“ 1/2 || p is the relative Frobenius norm. The scaling p 1 / 2 


is exploited to ensure ||S||^ 


= 1 . 


The term 1/^/p in w n is the price we need to pay 


for estimating the unknown factors. But in the high dimensional regime p > n so that 
1 /y/p < \J\ogp/n , the rate is optimal. The original proofs for getting the above rates are 
mathematically involved and is not clear why the optimal rates can be attained, especially 
when no sparsity assumption for eigenvectors was imposed as in sparse PCA literature. 

We propose a generic POET procedure here: (1) given three initial pilot estimators 
E, A, r for true covariance matrix E, leading eigenvalues A = diag(Ai,..., A m ) and leading 
eigenvectors T pxm = (£ l5 ... , £ m ) respectively, the principal orthogonal complement E u can 
be computed by subtracting out the leading low-rank part, i.e., 

E„ = E - TAT . 


T 


(2) The adaptive thresholding (2.5) is applied to E u to obtain E u , and (3) the low-rank struc¬ 


ture is added back to obtain E . Note for sub-Gaussian distributions, A = diag(Ai,..., A m ) 
is the diagonal matrix constructed by the first m leading empirical eigenvalues of the sample 
covariance matrix E while T = (£ 1; ..., £ m ) is the matrix of corresponding leading empirical 
eigenvectors. But in general, A and T do not have to come from the sample covariance 
matrix. In fact, they can even be separately estimated. 

So our question is: why such a simple POET procedure works under the piked covariance 


assumption (2.1)? Can we replace the sample covariance matrix by other pilot estimators 


as a starting point for the eigen-strucuture if other family of distributions, such as elliptical 
distributions or other more general heavy-tailed distributions, are considered? 


2.3 A high level theoretical interface 

A high level explanation is provided to understand the generic POET procedure. Suffi¬ 
cient conditions are brought up for and E to achieve the desired rates of convergence 


in (2.6) and (2.7). Our vital conclusion is stated in the following theorem. 


Theorem 2.1. Under Assumptions 2.1, if 3 C > 0 such that ||B|| max < C and C 1 < 


|E u || 2 < C. If we have estimators E.T, A satisfying (1.6), then the rates of convergence in 










(2.6) and (2.1) hold with the generic POET procedure described in Section 2.2 


The proof given in the appendix to obtain (A.l) provides insights on how the generic 


POET procedure works. Note that the max norm of low rank matrix estimation is bounded 
by Ai and A 2 . The former quantifies the estimation error of leading empirical eigen-structure 
TAT for its population counterpart, while the latter measures the error of identifying the 
low rank matrix BB 7 by TAT 7 from the true matrix X. The identification of low-rank 
and sparse matrices under pervasive condition is asymptotically unique with identification 
error Ao = 0(l/^/p). Additionally, the estimation contributes an error term of order Ai = 
0 P (y/\ogp/n). 


2.4 Conditional graphical model 


In Section 2.2l m p measures the sparsity of E u , but its inverse fl u = S ^ 1 is not nec¬ 


essarily sparse. Sometimes, the sparsity structure on reveals more interesting structure 
than Ti u . For example, If u t ~ EC P ( 0, E u , £), the sparsity of Q u encodes the conditional 
uncorrelatedness relationships between all variables in the p dimensional vector u t . More 
specifically, for p nodes ui,... ,u p , each corresponding to one element of u t , Ui and Uj are 
connected if and only if (O u )ij 7 ^ 0 , meaning that u lt and Uj t are uncorrelated conditioning 
on all the other {ukt}k^i,j and f f . If the number of factors is zero, this reduces to the classi¬ 


cal elliptical graphical model, exhaustively studied by Vogel and Fried (2011) and Liu et al. 


( 2012 ). 


In many applications, the conditional graphical model (or conditional sparse inverse co- 
variance model) appears more natural compared to the conditional sparse covariance model. 
For example, in understanding the dependence of financial returns, the interest lies in the 
graphical model of the idiosyncratic components after taking the common market risk factors 
away; in genomic studies, the graphs after taking the confounding factors such as age and 
environment exposure are of better interest. The factors can be interpreted as covariates 
that need to be adjusted before focusing on the analysis of correlatedness of the residual 


part (Fan et al. 2011). Cai et al. (2012) adopted the same idea of adjusting the factors in 


genomics application, but they do not assume the factors are pervasive so that they need to 
impose the constraint of a sparse factor loading matrix B. The sparsity was put on il „ and 
measured by the quantity 

M p : = max > | u U ij\ q . 

i<p ^^ 

j<p 

The generic POET procedure could also be modified to estimate conditional graphical 

~ ~ ~ / 

model. The first step is still recovering = S — TAT by removing the effect of low- 
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rank dominating factors. Then the method “constrained -minimization for inverse matrix 
estimation” (CLIME) proposed by 


Cai et al. 


(2011) can be applied to obtain fi u . Specifically, 


CLIME solves the following constrained minimization problem: 


fi u = argmin^ subject to 


l^ufi 11| max 


where | 121 


< r, (2.8) 

n . A further 


.11,1 = Ei Ej \u)ij\ and r is a tuning parameter so that t x w 
symmetrization step can be carried out to guarantee a symmetric estimator 12 „ = (Cj u .ij) 
where 

&u,ij = ■ ( 2 -9) 


Note that the optimization in (2.8) can be solved column by column using linear program¬ 


ming. Other possible methods can also be considered including graphical Lasso, graphical 


SCAD, graphical Dantzig selector, and graphical neighborhood selection (Friedman et al. 


2008 

Yuan and Lin| 2007 Fan et al. 2009 Lam and Fan, 2009 

Ravikumar et ah, 

2011 

Yuan 

2010; 

Meinshausen and Biihlmann 

2006 

). Though substantial amount of efforts have 


been made to understand the graphical model, little has been done for estimating conditional 
graphical model, which is again more general and realistic. 

Once we have 12 the original inverse covariance matrix 12 = 5R 1 can also be estimated 
using the Sherman-Morrison-Woodbury formula as follows: 


« - n„ - n„f (A 1 + r'n„f)- 1 r n 


( 2 . 10 ) 


The following theorem gives the rates of convergence for fi„ and 12 provided good pilot 
estimators X, A and T are given. Its proof is in Appendix [aJ 


< C and C 1 < 


Theorem 2.2. Under Assumptions 2.1, if 3 C > 0 such that ||B |, 

||12,J 2 < || 12 u 11 oo < C and the estimators S,T, A satisfy conditions nlM). Then the generic 
POET procedure with CLIME gives 


^2u fiu max — Op(w n ) — 12 12 max j . . 

A (2.11) 

||fiu — fin ||2 = Op{M p w] l q ) = || 12 — fi|| 2 ■ 

Note the assumption of bounded ||fi u ||oo is stronger than the case of estimating covariance 
matrix. This condition might be relaxed if other methods instead of CLIME was applied. 
But we do not pursue the weakest possible conditions here. Many potential applications are 
only involved with the estimation of inverse covariance matrix 12, for instance classification 
and discriminant analyses and optimal portfolio allocation in finance. 


10 

































































2.5 Positive semi-definite projection under max norm 

There is an additional issue that requires careful consideration. In the generic POET 
procedure, if T and A are not estimated from the same positive semi-definite (PSD) matrix 
£, the residual £ M may not be PSD for a given sample. Thus, the following optimization 
should be considered to find the nearest PSD matrix of £ u in terms of the max norm: 


T, u = argmin s ^ 0 ||£„ - £ 


u max ■ 


( 2 . 12 ) 


The minimizer preserves the max norm error bound since 


|£ U £u||max ^ II £u £«||max 


|£« ^dlmax — 2||£« £u||maxj 


and everything else in the POET procedure works with £ u replaced by £ u . The same 
problem occurs in conditional graphical model estimation. Although Cl u is PSD with high 
probability, in practice we may reach a non-PSD estimator for fi u . So we need to explicitly 


perform the PSD projection of fi u onto the PSD cone as in (2.12). 


Minimization (2.12) is challenging due to its non-smoothness. An effective smooth sur¬ 


rogate for the max norm objective was proposed by Zhao et ah (2014) which can be solved 
efficiently. Specifically, they considered minimizing ||£ t 


climax subject to £ u >z 0 where 


l A llmax = max (U, A) - 7 ^||U||| , 
||u||i i<i 1 2 


where HUjlip = Yh 3 \ u ij\- More details can be found in Zhao et al. (2014). Another possi¬ 


bility to ease computation burden is solving the dual problem of graphical lasso, that is, 


maxlogdet(W) subject to ||W 


£ 11 max ^ • 


By choosing r x w n , the optimal solution is a PSD matrix satisfying the max norm bound. 
Such a projection is still valid for the generic POET procedure to get the desired convergence 
rates under max norm. 


3 Sub-Gaussian factor models 


We have established sufficient conditions in (1.6) for optimal estimation of covariance 


matrices as well as conditional graphical models. The next natural question is whether these 
conditions hold for sub-Gaussian factor models. In this subsection, we validate the conditions 
for sample covariance matrix under sub-Gaussian conditions. 
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By the spectral decomposition, £ = r p A p r) ( where A p = diag(Ai,..., A p ) and T p is 
constructed by all the corresponding eigenvectors of £. We use subscript p to explicitly 
denote the dependence of A p and T p on all eigenvalues or eigenvectors rather than just spiked 
ones. Let x* = r) ( y, ; . So Xj has mean zero and diagonal covariance matrix A p . Since under 
orthonormal transformations of the data, the empirical eigenvalues of sample covariance are 
invariant and the empirical eigenvectors are equivariant, the analysis will be done on x,;’s 
which naturally extends to our original data yds by a simple affine transformation. The 
following assumption on x, is imposed. 

Assumption 3.1 (Sub-Gaussian distribution). Let Zj = A p 1//2 Xj be the standardized ver¬ 
sion of the transformed data x,;. z,’s are iid samples of sub-Gaussian isotropic random 
vector z, i.e., ||z||<^ 2 = sup u&SP -i || (z, u) ||< M for some constant M > 0 where the sub- 
Gaussian norm is defined as ||(z,u)|| (f >2 = sup p >ip 1/2 (E| (z, u)|p) 1 /p. Furthermore, we as¬ 
sume 3M i , M 2 > 0 such that for 0 < 9 < Mi, 


E 


exp 




< exp (M 2 9 2 p). 


(3.1) 


The above lemma require a slightly stronger condition than the classical sub-Gaussian 

in 


condition for z. It has to satisfy (3.1) for technical reasons discussed in Lemma D.2 


Appendix [D] This assumption is clearly satisfied if z has independent elements of sub- 


Gaussian variables (Vershynin 2010) although it could also hold for weakly dependent sub- 
Gaussian vectors. 


Under this assumption, trivially the first condition in (1.6) holds for the sample covariance 
matrix £y of y*, i.e., ||£y — £|| ma x = Op(y/\ogpjn). We present two theoretical properties 
next respectively on leading empirical eigenvalues {A j}fLi and eigenvectors {^ y }j =i of the 
sample covariance matrix £ of xds. These properties are useful for us to verify the remaining 


conditions of the high level theoretical interface described in (1.6). 


Theorem 3.1. Under Assumptions 2.1 and 3.1. for j <m we have 


IVAj —1| =0 P {n~ 1 ' 2 ), 


where Xj = Aj(£) is the jth largest eigenvalue o/£. 

Consider the empirical eigenvectors of £ for j < m. Each is divided into two parts 
^jb)' i where $, ]A is of length m corresponding to the spike component and 
corresponds to the noise component. 
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Theorem 3.2. Under Assumptions 2.1 and 3.1. for j <m we have 
(*) II^A — e^H = Op(n 1//2 ), where ej A is unit vector of length m; 


(a) ||ne,- 


jB l 


= O p (^\ogp/(np)) for any f2 px (p-m) s.t. Ut'Ul = I p _ 


p—m • 


The theorems state that under the pervasive condition that spiked eigenvalues are of 
order p, we are able to approximately recover the true leading eigenvalues and eigenvectors. 


In Fan and Wang (2015), the same phenomenon is observed when zfs are sub-Gaussian 
vector with independent elements. But here we do not require element-wise independence 


and relax the condition to any sub-Gaussian isotropic random vectors satisfying (3.1). The 
proofs of the above two theorems can be found in Appendix [Bj 

Given the above two theorems, let us validate the second and third conditions in (1.6). 
Define A sg = diag(Ai,..., X m ) where SG is short for sub-Gaussian. The second condition 
holds for A S g according to Theorems 3.1. Note that Sy and E share the same set of 

be the matrix 


empirical eigenvalues. To check the third one, let TfiG = (li , ■ ■ ■ , ) 


consists of the top m leading eigenvectors of Ey. If the whole eigen-space of E is written as 

*(Y) - ~(y) 

F P = (r, 17), then = r jj = F£ jA + Q£ jB . Therefore ^ ^ = T(£ jA - e jA ) + Q£ jB 

and 


i(X) 


SG 


11 max — max ||^. q 11 max 

j 

< max (\Z^||r|| max ||e M - e^|| + 
which is Op (-y/logp/ (np)) due to Theorem 


\si£ 


jB I 


(3.2) 


Theorem 


2.1 


3.2 


and the fact IF 


= 0 ( 1 /^) shown 


m 


Hence, we have shown that the sample covariance based estimators Ey, Asg 

this explains 


2.1 


and r 5G satisfy the sufficient conditions (1.6). Together with Theorem 
why POET achieves all the desired rates (2.6) and (2.7). 

We finally devote a remark to the assumption of zero mean of the observed data implied 


by Assumption 3H This condition is only made to simplify the presentation of proofs. In 
practice, we first center the data by y = ?r _1 JAy*. All the conclusions of this section hold 
for the centered data as well. 


4 Elliptical factor models 


In the previous section, we assume x* to be a sub-Gaussian random vector, which is 
a strong distributional assumption for many applications. In this section, we replace the 


sub-Gaussian assumption 3.1 by elliptical distribution assumption 4.1 and propose a novel 
robust estimator for the analysis of factor models. 

We first briefly review the elliptical distribution family, which generalize the multivariate 
normal distribution and multivariate t-distribution. Compared to the sub-Gaussian setting, 
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it is more challenging to design pilot estimators to simultaneously satisfy the three require¬ 
ments in (1.6). To handle this challenge, we separately construct two estimators Si and 
S 2 . Si and its leading eigenvalues satisfies the first two requirements in ( 1 . 6 ) while the 
eigenvectors of S 2 satisfies the last condition of (1.6). 


4.1 Elliptical distribution 

We define the elliptical distribution as follows. Let /1 E and S E M. pxp with rank(S) = 
q < p. A p-dimensional random vector y has an elliptical distribution, denoted by y ~ 
EC p (n, S, £), if it has a stochastic representation 

y = /i + CAU, (4.1) 


where U is a uniform random vector on the unit sphere in M 9 , £ > 0 is a scalar random 
variable independent of U, A E W° xg is a deterministic matrix satisfying A A' = X. Here 


X is called the scatter matrix. Note that the representation in (4.1) is not identifiable 
since we can rescale ( and A. To make the model identifiable, we require E () 2 = q so that 
Cov(Y) = X. In addition, we assume X is non-singular, i.e., q—p . If q < p, as long as they 
are of the same order, all results in the following still hold. In this paper, we only consider 
continuous elliptical distributions with P(£ = 0) = 0. 

An equivalent definition of an elliptical distribution is through its characteristic function, 
which admits the form exp(it / / u)'^(t / Xt), where is a properly defined characteristic function 
and i := yf—l. ( an d if are mutually determined by each other. In this setting, we denote by 
y ~ EC p (/j,,'E,ip)- The marginal and conditional distributions of an elliptical distribution 
are also elliptical. Therefore, in factor model (1.1), if f t and u t are uncorrelated and jointly 
elliptical, i.e., (f(, u()' ~ EC P ( 0, diag(I m , X u ), £), then we have y t ~ EC P ( 0, X, (). 

Compared to the Gaussian family, the elliptical family provides more flexibility in model¬ 
ing complex data. The main advantage of the elliptical family is its ability to model heavy-tail 


data and the tail dependence between variables (Hult and Lindskog, 2002), which makes it 


useful for modeling many modern datasets, including financial data (Rachev 2003 Cizek 


et al. 

2005 

), genomics data ( 

Liu et al. 

2003 

Posekany et al. 

2011 

), and fMRI brain-imaging 

data 

Ruttimann et al. 

1998 

)• 




The following assumption is considered in this section. 


Assumption 4.1 (Elliptical distribution). The data y i’s are elliptically distributed, i.e., 
Yi ~ EC p (n, X, () or Yi = /i + ^XsU, with Uj uniformly distributed on the unit sphere 5 P_1 
and the random variable (i > 0 independent from Uj. Additionally, we assume E[Cf] = P 
due to identifiability and maxj< p Ei/L is bounded. 
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The above assumption is implied by imposing a joint elliptical model of the factors and 
noises, i.e. , (f/, uj)' ~ EC P ( 0 , diag(I m , £ u ), (). Obviously, the elliptical family is more general 
than Gaussian assumption and contains heavy tail distributions. One typical example is 
multivariate t-distribution with degrees of freedom v > 4. The moment condition is imposed 
only for the sake of estimating marginal variances by methods discussed in Section |4.2| This 
assumption may be relaxed if other methods are applied. 


4.2 Robust estimation of variances 

Let £ = DRD where R is the correlation matrix and D = diag(cxi,..., cr p ) is the diagonal 
matrix consists of standard deviations for each dimension. Our construction of Si is based 
on separately estimating D and R. In this subsection, we first introduce a robust estimator 
D to estimate D. 

Since y* ~ EC p (fi, X,C) exhibits heavy tails, we need a method to robustly estimate 
/i in order to center the data and estimate the covariance matrix. Substantial amount of 


research has been conducted on this subject in both low dimensional setting (Huber, 1964 


Zou and Yuan, 2008; Wu and Liu 2009) and high dimensional setting (Bclloni et al. 2011 


Zou and Yuan, 

2008 

Fan et al., 

2014a 

)• 


In addition, Koenker (2005) has considered problem from a quantile 


regression perspective. In this section, we introduce two M-estimator methods proposed by 


Fan et al. (2014b) and Catoni (2012), who borrow the original idea from Huber (1964). The 


methods are also useful for robust estimation of variances. 

Let us denote /i = (pi, ..., p p )' and yj = (yn, ..., y ip )' for i — 1,..., n. We estimate each 
fij using the data {yij, ..., y n j}- The M-estimator jl = (jj q,..., p, p )' of Fan et al. (2014b) is 
obtained by solving 

n 

J2 h [ a (yij - h)] = o ( 4 - 2 ) 


i =1 


for each j < p, where h : M —> M is the derivative function of the Huber loss satisfying 
h{x) = x if |x| < 1, h(x) — 1 if x > 1 and h{x) = — 1 if x < —1. The above estimator can 
be equivalently obtained by minimizing the Huber loss 


£ a (x) = 


2a- 1 


x 


\x — a 2 

: x 


: a; 


, -i. 


„-i 


According to 


Fan et al. 


(2014b), choosing a = ^/log(e _1 )/(nu 2 ) for e G (0,1) such that 
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log(e x ) < nj 8, where v is an upper bound of max{af,..., cr 2 }, we have 


p (lAj - h \ - 


> 1 - 2e . 


(4.3) 


Catoni (2012) proposed another M-estimator by solving (4.2) with a different strictly 
increasing h(x) such that — log(l — x + x 2 /2) < h(x ) < log(l + x + x 2 /2). For a value 
e G (0,1) such that n > 21og(l/e), let 


a = 


2 log(e -1 ) 


\ n t v + 2i;lo g( e b ) ’ 

\ ''W + n—2 log(e - !) / 


where v is again an upper bound of max{ erf, 


, a 2 }. Catoni (2012) showed that the solution 


of (4.2) satisfies 


f(|Aj - i*j\ < 


2v log(e 


-i' 


n 


21og(e 


-n / - 


> 1 - 2e. 


(4.4) 


Therefore, by taking e = 1 /{n V p) 2 , \fi — /i|oo < C^/logp/n with probability at least 
1—2(nVp) _1 for both methods. We implement Catoni’s estimator in the simulation by taking 
h(x) = sgn(x) log(l + |x| + x 2 /2). For the choice of v, we simply take v = maxjbf,..., cr 2 }, 
where a 2 are the sample covariance of the jth dimension. 

To estimate a 2 , we apply the above M-estimation methods on the squared data. Note 
that a 2 = E (Y 2 -) — p 2 . We have estimated p 3 above. To estimate E (Y 2 -), we employ the M- 


estimator (4.2) on the squared data {y\j, ■ ■ ■, y 2 3 }, denoted by fj 3 . This works as the fourth 


moment of ]ji 3 is assumed finite. The robust variance estimator is then defined as 


a 2 = max{% - p 2 , 5 0 } , (4.5) 

where ho > 0 is a small constant (ho < minjerf,..., a 2 }). If n > Clogd, let D = 
diag(<Ti,..., d p ), we have 

||D — D|| = 0 P (\J\ogp/n). (4.6) 

Additionaly, due to the structure of £ = B ; B + E u , where ||S U || < C and ||B|| max < C, it 
is easy to see ||D|| = 0(1) and ||D II - Op(l). 


4.3 Marginal Kendall’s tau estimator 

We now provide a pilot estimator to robustly estimate the correlation matrix R = (r jfc ) 
when data follow elliptical distributions. The idea of Kendall’s tau statistic was introduced 
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by Kendall (1948) for estimating pairwise comovement correlation. Kendall’s tau correlation 
coefficient is defined as 

Tjk 


n(n 


-^sgnUry-y^MYk-m)), 

i<i' 


(4.7) 


whose population counterpart is 


Tjk := P((y y - Y 2j )(Y lk - Y 2k ) > 0) - P((Y li - Y 2j )(Y lk - Y 2k ) < 0). 


(4.8) 


Note that the estimator does not depend on the location /i. So without loss of generality, 
we assume /i = 0. Then y EC P ( 0, £, () with independent and identically distributed 
samples yi, ... ,y„. 

Denote by T = (rj k ) and T = (fj k ). For elliptical family, it is known that the nonlinear 


relationship r jk = sin(| Tj k ) holds for the Pearson correlation and Kendall’s correlation (Fang 


et al. 

1990 

Han and Liu 

2014) 


2014). Therefore, a natural estimator for R is R = (fjk) where 

7T „ 


T jk = sm ( ~T jk 


(4.9) 


By Theorem 3.2 of Han and Liu (2013b), with probability larger than 1 — 2e — e 2 for any 
e G (0,1), 


|R — R||2 

< 7T 2 IIR| 


2x l ( tr R/llR|| 2 + l)log(p/e) | (trR/||R|| 2 + l)log(p/e) 


3 n 


n 


Using the fact ||D|| 2 ||S|| < ||R|| < ||D 1 || 2 ||S||, we know ||R|| x ||S|| x p since all the 
eigenvalues of D are bounded away from infinity and zero. This is true because A m i n (D 2 ) > 


A min (S) > c 0 and ||D|| =0(1) as derived in Section 4.2. This implies 

||R - RIB = Op| 


p 2 logp\ 


n 




(4,10) 


Combining the rates in (4.6) and (4.10), we conclude 


Aj(DRD) - Aj(DRD) < ||DRD - DRD| 
= Op(||(D — D)RD|| + ||D(R — R)D||) 

- 0p (/dhU). 
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Define Si = DRD. The estimator A ED = diag(A 1 (S 1 ),..., A m (Si)), which consists the 
first m eigenvalues of Si, satisfies 

||(A ed - A)A-'|| = ■ (4.H) 

Here ED is short for elliptical distribution. This makes the second sufficient condition in 
hold. Furthermore, we can easily check that the first sufficient condition holds for Si 
using the concentration of U-statistics, i.e. 



|Si-S| 


- Opfviogp/n) . 


(4.12) 


Although the marginal Kendall’s tau based estimator Si has good properties for eigen¬ 


values, it is hard to prove the third sufficient condition for eigenvectors in (1.6) due to the 


complicated nonlinear sin(-) transformation. Luckily, we do not require T and A in (1.6) to 


come from the same covariance estimator. In the next section, we propose another covariance 


estimator S 2 whose eigenvectors satisfy the third sufficient condition in (1.6). 


4.4 Multivariate Kendall’s tau estimator 


To find an estimator T FjD that satisfies the third condition in (1.6), we resort to the 


multivariate Kendall’s tau estimator. We focus our analysis again on the transformed data 
x, = T^y. The population multivariate Kendall’s tau matrix is defined as 


K := E 


(xi - x 2 )(xi - x 2 y 


(4.13) 


||X! - X 2 II 2 

The sample version of the multivariate Kendall’s tau estimator is a second-order U-statsitc: 


K : = 




n(n — 1) 


(4,14) 


%<v 


where 




(Xj Xj/)(Xj Xj/) / 


X.; - Xi/ 


v 112 


Several important properties of the above estimator is worth mentioning. First this es¬ 
timator is location invariant, which allows us to assume /1 = 0 without generality, i.e., 

j 1 „ 

x, = CjApUj. Secondly, the eigenvectors of the estimator K is equivariant to orthogonal 
transformation. So if we define the multivariate Kendall’s tau estimator based on the ob- 


18 













served data y r as 


So = 


n(n — 1) ^ 

y ' i<i' 


^fc(yi,yi') = r P Kr;, 


(Y) a. ^ (y) 

we have £• = r p £-, where £■ and £■ are the j th empirical eigenvector of K and S 2 , 

respectively. 


The most important feature of the U-statistic estimator in (4.14) is that its kernel 
fc(xj,Xj/) is distribution-free. To see this, we have 

X - X = ca|u - CA|U = CA|U, 


where X is an independent copy of X and the characteristic function of ( is determined by 


that of (. See Hult and Lindskog (2002) for the detailed expression of the characteristic 
function. Thus, 


Jfc(X,X) = 


(x-x)(x-xy 


MUU'A 3 


p d 


A P 2 gg'A 


IX 


XIII 


U'ApU 


g'Apg 


which depends only on the multivariate standard normal vector g. The last equality is due 


to U = g/1|g||. Thus K defined by (4.13) is a diagonal matrix by the symmetry of g. 
Write K = diag( 6 ) 1 ,..., 0 p ), where d 3 is defined as 


6 3 =e( 




VELi x k9, 


which is a multiple of A j. Obviously, K shares the same eigenvalue ordering as that of 
Cov(x ? ;) = A p , and thus the same eigenspaces as those of Cov(xj). So estimating the leading 
eigenvectors of Cov(xj) is equivalent to estimating those of K. In sum, S 2 particularly fits 
the goal of estimating the eigenvectors of X. 

The above multivariate Kendall’s tau statistic is first introduced in IChoi and Mardenl 


(1998) and has been used for low dimensional covariance estimation (Visuri et ah, 2000) and 


principal component estimation (Marden, 1999; Croux et ah, 2002). Many testing literature 


based on rank statistics is also related to the estimator, for example Tyler (1982); Hallin and 


Paindaveine (2006). The literature listed here is only illustrative rather than complete. 


We now consider the theoretical properties of the eigenvectors of K. As before, £ ■ is 


divided into the spiked part £j A and noise part £ 


jB- 


Theorem 4.1. Under Assumptions 2.1 and 4-1, for j <m we have 
(i) ||£ — e^H = Op(n -1 / 2 ), where ej A is a unit vector of length m; 


19 







































(%%) ||O^.J ma x = Op (A/log p/(np)) for any ft pX (p_ m) s.t. Tl'Tl = I p _ m . 
The proof for Theorem 


4.1 


is 


4.1 


relegated to Appendix [cj Define S 2 as the multivariate 
erved c 
implies 


- (Y) - (Y) 

Kendall’s tau estimator of the observed data yds and T ED = (£ x ,..., ) as the leading 

eigenvectors of X 2 . Theorem 


ED 


= 0 P (y/\ogp/(np)) 


following the same argument (3.2) in Section [3] So the third sufficient condition in (1.6) 
holds for Ted ■ Together with the estimators Si and A E d defined in Section 


4.3 


we are 


ready to apply the general POET procedure for the heavy tail factor model and achieve all 
the desired estimation convergence rates for both covariance and precision matrices. 


5 Simulations 

Simulations are carried out in this section to demonstrate the effectiveness of the proposed 
method for elliptical factor models. The robust estimators S x , A ED , T ED proposed in Section 
[4] will be compared with the original POET estimator based on the sample covariance, or 
£ y, A sc, f\sG- discussed in Section [ij We put the two sets of estimators into the general 
POET framework described in Section [2] for estimating both conditional sparsity covariance 
and conditional graphical models. 


5.1 Conditional sparse covariance estimation 


In this section, we consider the factor model (1.1) with (f t , u t ) jointly follow a multivariate 
t-distribution with degrees of freedom v. Larger v corresponds to lighter tail and v = 00 
corresponds to a multivariate normal distribution. We simulated n independent samples of 
(ft, Ut) from multivariate t-distribution with covariance matrix diag(I, n ,I p ) and each row of 
B from A/"(0, I m ). The observed data is formed as y t = Bf t + and the true covariance is 
X = BB ; + I p . We vary p from 100 to 1000 with sample size n = p/2, and fixed number of 
factors rn — 3 in this simulation. 

For each triple ( p,n,m ), both the original POET estimator (Sy, A 5G , T^g) and the 
proposed robust POET estimator (Si, A EE , T EE ) were employed to estimate £ u and S. 100 
simulations were conducted for each case. The log-ratio (base 2) of the average estimation 
errors using the two methods were reported in Figure [lj measured under different norms 


T 


S; IIS-S 


T '^ 1 -S^lhforE^ ||S T —S|| ma .x,llS T 


1 2 and || (£ u ) 

| ma x, || (A —A)A _1 || 2 and \\t-T 


S _1 || 2 for 


S||j; and || (S 
for initial pilot estimators). In addition, three 
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different degrees of freedom v = 4.2, v — 7, v — oo were chosen, representing respectively 
heavy tail, moderate heavy tail, and normal situations. 

From Figure [lj when factors and noises are heavy-tailed from £ 4.2 (black), the original 
POET estimators are poorly behaved while the robust method works well as we expected. 
tj (blue) typically fits financial or biological data better than normal in practice. In this 
case, we also observe a significant advantage of the robust POET estimators. The error is 
roughly reduced by a magnitude of two if the rank based estimation is applied. However, 
when the distribution is indeed normal or t ^ (orange), the original POET estimators based 
on sub-Gaussian data performs better, though the robust POET also achieves comparable 
performance. 


5.2 Conditional graphical model estimation 


In this section, we consider the conditional graphical model described in Section 2 A. In 
particular, we compare the accuracy of different methods for estimating the precision matri¬ 
ces fl u and fl. Here we assume a block diagonal precision error matrix fl u = diag(M,..., M) 
where M is 2 by 2 correlation matrix with off-diagonal element equals 0.5. Then we simu¬ 
late (f t , u 4 ) again from multivariate t-distribution with covariance diag(I m , ft)) 1 ). We set the 
dimension p to range from 50 to 500, sample size n = 0.6 p and a fixed number of factors 
m = 3. 

For each configuration of (p, n, m), after applying POET with the original and robust pilot 
estimators, we estimate fl u and fl as proposed in Section 


2.4 


using the CLIME procedure. 
To efficiently solve large-scale CLIME optimization (]2.8|), we used the R package “fastclime” 


developed by Pang et al. (2014). 100 simulations were conducted for each case. The log-ratio 
(base 2) of the average errors of the two methods were reported in Figure [2j measured under 
spectral norms |||| 2 and ||f2 — f2|| 2 . Three different degrees of freedom v = 4.2 (black), 
v = 7 (blue), v = 00 (orange) were used as in Section 5.1 Clearly, the robust estimators 
outperform non-robust ones for £4.2 and £7, and maintains competitive for the normal case. 


6 Discussions 


We provide a fundamental understanding of high dimensional factor models under the 


pervasive condition. In particular, we extend the POET estimator in Fan et al. (2013) to be 


a generic procedure which could take any pilot covariance matrix estimators as initial inputs, 


as long as they satisfy a set of sufficient conditions specified in (1.6). Transparent theoretical 


results are then developed. The main challenge is to check the high level conditions hold 
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Figure 1: Conditional sparse covariance matrix estimation. The 8 plots corresponds to 
logarithms (base 2) of the ratios of average errors of the original and the robust POET esti¬ 
mators, measured in different norms. Data were generated from multivariate t-distribution 
with degree of freedom v = 4.2 (black), v — 7 (blue), v = oo (orange) with p from 100 to 
1000, n = p/2 and m = 3. 100 simulations were conducted for each p. 
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Figure 2: Conditional graphical model estimation. The plots corresponds to log ratio (base 2) 
of average errors of the original and the robust POET estimators for fl u and SI, measured in 
spectral norms. Data were generated from multivariate t-distribution with degree of freedom 
v = 4.2 (black), v — 7 (blue), v = oo (orange) with p from 50 to 500, n = 0.6 p and m — 3. 
100 simulations were conducted for each p. 


for certain estimators. When the observed data y, is sub-Gaussian random vector, we are 
able to simply use sample covariance matrix to construct initial estimators. However, if we 
encounter heavy-tailed elliptical distributions, robust estimators for eigen-structure should 
be considered. The paper provides an example of separately estimating leading eigenvalues 
and eigenvectors under elliptical factor models. But the results could possibly be generated 
to other richer family of distributions. 


Based on recent work of Fan and Wang (2015), it is possible to relax the spiked eigenvalue 
condition from order p to weaker signal level. But bounded eigenvalues are obviously not 
enough for consistent estimation as has been pointed out by Johnstone and Lu (2009), and 


as a result more structural assumptions are needed. Agarwal et al. (2012) considered a 
similar type of low-rank plus sparse decomposition, but their work is based on optimization 
technique and does not leverage pervasiveness. In addition, they only analyze the obtained 
estimator using Frobenius norm errors. Consequently, their lower bound results are not 


applicable to our setting. The lower bound for the rates in (2.6) and (2.7) will be pursued in 
a separate work. By all means, the similarity and difference between optimization thinking 
and pervasiveness thinking should be studied in further details. 


A Proofs in Section 2 


Proof of Theorem 12. 11 We establish (2.6) here and defer the details of the proof of (2.7) 
to Appendix |a}. To obtain the rates of convergence in (2.6), it suffices to prove ||S U — 


•‘u max 


= Op(w n ). Once the max error of sparse matrix is controlled, it is not hard 
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to show the adaptive procedure discussed in 

2.5 

) gives £,T 

llU - £„|| 2 = Op(m p w l n -) ( 

Fan et al. 

2011 

Cai and Liu, 


2011 


Rothman et al. 


error 


2009). 


K%- So IKsj^-s, 


-ii 


2 IS 


Furthermore, ||(S U ) 1 - || 2 < ||(E U ) Ibll^^ - U u ||2 

also Op(m p w^~ q ) due to the lower boundedness of ||S U || 2 . 

According to first condition in (1.6), ||S — £|| max = Op(y/logp/n). Therefore to show 
|| — S u || max = Op(w n ), we only need to prove the low rank part of £ concentrates at a 
desired rate under max norm. So our goal is to prove 

||rAr - BB X || max = Op{\J\ogp/n + 1 /y/p). 


(A.l) 


Let BB ; = TAr where A = diag(||bi 


| 2 ) and the jth column of T is b,-/llb 


To obtain (A.l), we bound Ai := ||rAr — TABUmax and A 2 := ||rAr — TAB 


3 / 

max S6p^- 


rately. Four useful rates of convergence are listed in the following: 


||A-A|| max < ||£ u || =0(1), 

||r - r|| max < cp^uW/p = o(i/p ), 

II(A - A)A _1 || max = Op (i/log p/n ), 
l|r - r|| max = Op (\/iogp/ (np)). 


The first one is due to Weyl’s inequality since ||A — A|| max = ||A — A || 2 while the second 
follows from trivial bound ||T — F|| max < ||F — T||p, which is further bounded by C||£ u ||/p 
according to the sin 9 theorem of Davis and Kalian (1970). The third and fourth rates are 
by assumption. Next we show ||r|| max = 0(1/y/p) and derive the rates for Ai and A 2 . 

Note that 


ITAs 


fAU 


max — 


< c 


BA 
B 


2 (A2 • 

max + 


A*' 


(r-f)A 


Vp 


= 0 ( 1 ). 


Since ||B|| max = ||fA 2 || max = 0(1), we have || FA 1/2 1| max = 0(1) and ||F|| max = 0(1/^). 
Using this fact, the following argument implies Ax = Op(^/logp/n) and A 2 = 0(\Jl/p). 
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More specifically, 


Ai < ||f (A - A)f || max + II (f - r)A(f - r) / || max + 2||rA(f - r) / || max 
= O p (p _1 ||A - A 11 max + y/p\\t ~ r|| max ) = Op(y/\ogp/n) , 
a 2 < ||f (A - A)f|| max - ||(r:-r)A(f - r)'|| max + 2||rA(f - r)'|| max 
= 0(p _1 ||A - A11 max + y/p\\T - r|| max ) = 0 (y/ljp). 


Combining the rates of Ai and A 2 , we prove (A.l). Thus (2.6) follows 


Now let us prove (2.7). The first result follows from 


lv — y ll <r 

\^u ^it 11 max _ 


\K - s 


u max 


+ 


IV _ V II 

^ it 11 max 


= Op(r + w n ) = Op(w n ) when r is chosen as the same order w n and 


|S T - SI 


max A IlfAf — BBdLax + IIS — S t 


= Op(w r 


We now prove the remaining two results. Suppose the SVD decomposition of S = r p A p r' ; 
where T p = (T. f2) and A p = diag(A, ©). Then obviously 


IS -S|| s <p" 1/2 


s^tat — bb ; )s 


T 


S-2(S U -S U )S- 


= : A l + A 5 , 


and 

V <P 

It is easy to show 


A* < p-^HS-'llllsI - SJ| P < CIISI - S tt || = 0 P {m p w]- q ^ 


© ^O' 


A L = P 

A A li + A L2 + 2A 13 


A- 2 r' 


TAT - BB') ( TA“2 f2©§ 


(A.2) 


(A.3) 


(A.4) 


where A L1 = ||A“5r'(f Ar -BB)rA ~2 \\ F /^/p, A L2 = ||©-50'(fAf -BB')O0-5 \\ F /^/p 
and A^ 3 = ||A"2r / (TAr — BB ; )S7©^5 Wp/^Jp. In order to characterize the rate of conver¬ 
gence under relative Frobenius norm, we analyze the terms A^i, A F2 and Ap^ separately. 


According to Theorem 4.1 of Fan and Wang (2015), A L1 < ||A 2 r'(fAr 

BB^rA^sl^ — 0 P (n _1 / 2 ). A L2 is bounded by 


P 


- 1/2 f||0"2OTAro0-2|| F + ||0-5n , fAf'n©“5| 


FI =:AS + Ag>, 
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where 


A2 < p- 1/2 ||0 _1 ||||n , (f - r)|||||A|| = 0 P (Vplogp/n), 

because ||IT — r|| max = 0 P {flogp/ ( np )) by assumption and ||A|| = 0 P (p). Similarly, A^ = 
Op(l/p 3 / 2 ) as \\n'r\\ F < y/m\\Q't|| 2 = v^lirr 7 - ff'|| 2 = 0{ ||E u ||/p) = Op{l/p) by the 


sind Theorem (Davis and Kalian, 1970). Finally, Ap 2 = 0 P ( y / P \ogp/n + 1/p 3 / 2 ). Following 
similar arguments, Ap 3 is dominated by A L1 and Ap 2 . Combining the terms A L1 , A i2 , Ap 3 
and A5 together, we complete the proof for the relative Frobenius norm. 

We now turn to analyze the spectral norm error of the inverse covariance matrix. By the 
Sherman-Morrison-Woodbury formula, we have 


T, 


(£ ) 


-1 


s- 1 !!, < 


.T, 


-1 


i-l 1 


+ Aj. 


(A.5) 


where 


Aim, = ||(sI)“ 1 rA I (I m + J)- 1 A 2 fitly 1 - S“ 1 fA^(I m + j) _1 A®f , E“ 1 ||, (A.6) 


where J = A 2 T (E u ) ^A 2 and J = A 2 T E u a f A 2 . The right hand side can be bounded 

A J * * 1 A 

by the terms representing the differences of the “hat” part (S u )^\ TA 2 , (I m + J) -1 and the 
“tilde” part (E^) _1 , fA*, (I m + J) -1 : 


(((EJ- 1 - E; x )rA 2 (i m + J)-MAr x:- 1 !! , 
IS-^f A^ - f A § )(I m + j)" 1 A l f , E- 1 ||, 
IE" 1 ?A f ((I m + j)" 1 - (I m + J)- 1 )A l f's; 1 | 


The first term is O P (rn p r wy q ): the second term is 0 P (p ~2 ||rA 2 — fA 2 ||) = 0 P (w n ); and 
the third term is Op(p||(I m + J) -1 — (I m + J) _1 ||) = Op(p _1 || J — J||) = Op(m p ^“ 9 ). Thus, 
A inv = 0 P (m p wy q ), which implies ||(E ) _1 — S _1 || 2 = 0 P {m p wy q ). Therefore, we finish 

d 


the proof of the remaining parts of (2.7) in Theorem 2.1 


J u max 


Proof of Theorem 12.21 From (A.l), we have ||E t 

uch that | 

so that the true fl is within the region of the 


= 0 P {w n ). Next we prove 
that the CLIME estimator will give f l u such that ||O u — O u || max = 0 P {w n ). Choose r > 


|^n||oo||^it ^'ullrnax d I||max 


constraint of (2.8). So 


|^u ^u||max A ||L2«(I E u O u )|| max + ||(I E u f2 u ) Li u ||max 


<lin 


u 11 00 


L u) Umax 

|I ^A^ullmax T || II oo ||I ^«^«||max , 
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since n u is a feasible solution of (|2.8|), and the 
1 


where the first term is bounded by r||0 
second term is bounded by rUfiulloo due to the optimality of Q u over Q u . Therefore with 
r x w n , we have ||f2, u — f2 u || max = Op(w n ). It is easy to see the symmetrization step does not 
change the rate of || Cl u — X} 

||n« - riuh = Op(M p w, 

From the proof of the spectral norm error of the inverse covariance matrix in Theorem 
we have ||0 — fi|| 2 = Op(M p wl^ q ). It remains to prove ||fi — f2|| max , which is by definition 


U u u "ii||max 

h-q'i 


. By similar arguments as in 


Cai et ah 


(2011), we obtain 


2.1 


bounded by 


o P (\\a n -n 


u max 


+ p lajhv 2 -o u rA2|| max 


+p 2 ||j - J|| max ), 


where J and J are defined in (A.6). After some calculations, its dominating term is Op(||h2 t 


f^u||max) — Op(w n ). 


□ 


B Proofs in Section 3 

Proof of Theorem 13. ll Define X = (x!,...,x p ) = (Z^A^ 2 , ZpA^ 2 ) where Z^ = 
(zi,...,z m), Zip = (z m+ i,... ,z p ) with z j = Xj/ -s/Xj and A a = diag(Ai,..., A m ), Ap = 
diag(A m+ i,..., A p ). In order to prove the theorem, we first define two auxiliary quantities as 
follows and analyze them separately. Let 

m p 

A = n~ l ^ XjZjz'j, and B = n 1 XjZjz'-. 

j =1 j=m +1 

In addition we define 

1 i p 

Snxn — “XX / = — XjZjZj = A + B, 

n n J 

3 =i 

which share the same nonzero eigenvalues with the sample covariance matrix X = n _ 1 X / X. 

We first prove A satisfy |Aj(A)/Aj — 1| = Op{n ~ 1 / 2 ). Note that A = n^Z^A^Z^ 
has the same eigenvalues as matrix A = n^Z'Z, where Z = Z^A ^ 2 is an n x m matrix 
with independent and identically distributed rows. Each row is a sub-Gaussian random 
vector with mean 0 and variance A^. Therefore, we are in the low dimensional situation 
with fixed dimension m, although eigenvalues diverge. By central limit theorem, the j th 
diagonal element of A is of order Aj(l + Op(n _1//2 )) and the (j, k) th off-diagonal elements are 
of order \J~XjX~kOp{;n~ 1 ^ 2 ). Therefore, Aj(A) = Aj(A) = A J (A l / 2 (I m + Op(n - 1 / 2 ))A^ 2 ) = 
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A,(Aa(I to + Op{n 1//2 ))) and furthermore 


AjA m i n (I m + Op(n 1 ^ 2 )) < Aj(A) < AjA max (I m + Op(n 1 ^ 2 )). 


Since dimension m is fixed, we have |Aj(A)/Aj — 1| = Op(n -1 / 2 ). 

Secondly, we have Afc(B)/Aj = op(n^ 1 / 2 ). By the definition of B, B = n^ZpApZ'^ 
where Zp is a n x (p — m ) random matrix with independent rows of zero mean and identity 
covariance and A B = diag(A m+ i, ■ • • , X p ). Since each row of Z B is independent sub-Gaussian 


isotropic vector of dimension p — m, by Lemma D.l choose t = -y/n, for any k < n, 


p — m 


Afc(ZpApZ' B ) —- A j + O 

p — m , 


j =771+1 


- ) = c + Op 

p 


(t® *•<■>■ 


Therefore, for k — 1,..., n, 

Afc(B) n\ k (B)p-m 


X< 


p — m nXj 


- 0p[ h+) - 0p{n ~ h) - 


By Wely’s Theorem, Aj(A) + A n (B) < A j < Aj(A) + Ai(B). Therefore, combining results 
for A and B, we conclude \Xj/Xj — 1| = Op{n ~ 1//2 ). □ 


Proof of Theorem 13.21 (i) We start by proving the rate for in the simple case of 
m — 1. In this case, by Lemma B. 1, we indeed have |£ ly i — 1| = Op(n -1 / 2 ) since p > n. 


In the following, we consider m > 1. Denote X = (Z^A^ 2 , ZpA]/ 2 ) as before. Recall 
£i,£ 2 are eigenvectors of X = n _1 X'X. Let ui,u 2 ,...,u n be the eigenvectors of 
X = nr 1 XX'. It is well known that for % — 1, 2,..., n, 


I,: = (nXi) i/2 X / Uj and ip = (rcAj) i/2 X$-, 


(B.l) 


Using (B.l), we have 


£jA 


Afz>, 


inXj 


and u, = 


+ Z B Af l jB 


nXj 


nXj 


nXj 


Since Uj is the eigenvector of X, that is, n 1 XX / u J - = XjUj. 
(Z^A^ 2 , ZpA^ 2 ), we obtain 


In - -Za^Z'a )uj = Du j - Au j , 

Ti /\ q 


(B.2) 


Plugging in X = 




















where we denote D = (nXj) 1 Z B A B Z' B , A = Xj/Xj — 1 . W e then left-multiply the above 
equation by X}[ 1 Z! A f ynXj and employ the relationship (B.2) to replace in,- by $,, )A and 


as follows: 


lm x ')Z.iA 


AfaZ' A Z A - I m )A f , A 1 / 2 Z / ,DZ 4 A 1/2 


1/2 


Xi 


t , —A & 

£ iA ^ c SjA 


n\ q 


Afz; 4 DZ B A/ 2 - . 

AjB - a 4/a 


nXj 


(B.3) 


Further, we define 


*= E 


A 


X j Xk 


e kA e kA 


ke[m]\j ' 0 

where R is well defined because m > 1. Then we have R(I m — A a /Xj) = I m — e^e'^. Left 


multiplying R to (B.3), we have 






R Afz k DZ^ ! 


nX-i 


where K = n ^ZI^a ~ I m + Xj(nXj) 1 Z' A DZ A - Dividing both sides by H^^H, we get 


s/A 


= R ( ) 2 K ( ) 2 e .M + r n , 


s/Al 


Aj 


(B.4) 


where 


r„. = 


L'A 




IjA 


*jA 


*jA I 


R 


aTz^DZbA 1 ^ d jB 


nXj 


- AR 


Da I 


s jA 


— e 


: /A 


sjAl 


Following Fan and Wang (2015), together with Lemma D.l, we can show ||r n || — op(n 2 ). 


(B.5) 


Further note that (Aa/X j) 2 GjA = ejA, we obtain 




jA 


\t, 


jA I 


A 


- e jA ) = \/nR( ‘Ke jj4 + o P (l) 


(B.6) 
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According to the definition of R, as p —> oo, 


n Ai \2 / 

J — / , ^ ^ e fcA e fcA ~^ / . , a jk e kA e kA 5 

■* fee MV? J k fce[m]\j 


where a jk = lim Aj . ;Afc ^. 0O y/X j X k /(X j - X k ) exists. So ||R(A a /Aj) 2 1| = 0(1). We claim 


K|| = Op(n 1//2 ), so the right hand side of (B. 6 ) is Op(l). To prove the rate of ||K||, note 


first by Lemma D.l, \\(p — m) 1 Z B A B Z' B — cl|| = Op(y/n/p), so we have 

1 ZpApZ' s 


| D || = 

l K ll = 

< 


n A j 


-o-i’-Zr )-=-(»-*), 


— Z^Za — I'm + Z^DZ^ 

n \. n 


— Z \Za — I m 


n 



A,- 


1 . 

+ 

3 

1 D 

-Z^Za 


^3 


n 


B.l 


= Op(n 2 ). 

(Ml = l + 0 P {l/n + l/p), 


Therefore, H^a/II^aII — e jvi|| = Op(n^ 1 / 2 ). From Lemma 
so (i) holds. 

(ii) Now we prove the second conclusion of the theorem on the non-spiked part 
Again we consider the case m — 1 and m > 1 separately. When m — 1, by definition of 
eigenvector, we can easily see that 


y -||zi|| 2 ) 6 a = —; 

Ai n / n\J Ai 


— X’b*i£ia + - X ' B X B ^ = \i%ib ■ 
n n 


Plug the former equation into the latter one, we have 

Xi£i b = A 9-X-p z i z i-X-B^ip d , 


rr 


n 


where A = A 1 /A 1 - ||zi|| 2 /b. Therefore, let f2' = ( 7 ^ ... , 7 p ), ||n£ ls || max = max fc < p |7 UibI 
is bounded by 


max 


k Ai 


A’ 1 ! 1 , Y , 

- 7 fcX B Z! 

n 


— z iXp 

n 


I^ibII + 


ly^x* 

n 


.IBM j 


which is Op(^/l/(n 2 p)) = O p(\Jlog p / {np)) due to the facts that A = Op(n 1//2 ), Ai = 
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Op(p), ||^ib|| = Op(n 1 / 2 ) according to Lemma B.l and three claims yet to be shown: 


\h'k X - , B Z A/n\\ = 0 P (y/lJn), ||7fcX' B XB/n|| = 0 P (y/pJn), 
\\Z' A X B /n\\ =0 P (y/pJn). 


(B.7) 


Now we turn to the case m > 1. Similar as derivations above, by definition, we have 


A,I m - --Afz' A Z A A‘i‘) SjA = -Arz'A^; 

^X' B Z A A l f$ IA + ix'jX^ = \jk jB . 

The former equation implies 

" “ 2' A x B i iB - A$ jA , 

where A = (Xj/Xj — l)I m — (A j 4 /A J ) 1//2 (n^ 1 Z^ 4 Z J 4 — Xj) 1 ^ 2 . Note that this definition 

of A degenerates to A 1 /A 1 — ||zi|| 2 /n when m = 1 and ||A|| = Op(n _1 / 2 ). Left multiply this 
equation by R defined in (i) and plug it into the previous equation, we obtain 


1/2 


, V 2 r 7' 


n 2 X 


^X^A^f RA^Z^X^ + -«X b ^ b 


n 


+ ^jA. e M X 'B^Te jA - ^' t X' B Z A \T^iA- 


l/2-i 


Carefully bounding each term of the right hand side by (B.7) and Lemma B.l, we find 


the dominating term is the third term, which has rate O p (^/p/n). Thus ||0£ 
max fc < p hUisI = °p(W(™P)) = 0 P (^/logp/(np)). 


IB Umax 


It remains to prove (B.7). Firstly, HZ^XB/nH 2 < \\Z' A ZA/n\\ ||Z/ri|| = Op(p/n). 


Thus the third result holds. To show the other two rates, denote v = Xp7 fc . Then each 
element of v is iid sub-Gaussian with bounded variance proxy. Hence v'v/n = Op(l) and 
v'zj/n = Op(n -1 / 2 ) for j < m. So we have 


I^X^ZV'nH A \/mma,x\Vzj/n\ = 0 P (n 1//2 ), 

j<m 


and 

ll7A,'XpXp/n|| 2 < HZpApZ^/nH |vV/n| = 0 P (p/n ). 
Now the proof is complete. 


□ 
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Lemma B.l. For j <m, ||£ y - A || = 1 + 0 P (n 1 + p x ) and ||£ jS || — 0 P (n : / 2 +p 1//2 ). 
Proof. Recall that X = (Z^A^ 2 , Z^A^ 2 ). Let Z = (Za,Zb), then 

z = xap = ^k^ 1 ,...,U'a;K 


where A p = diag(A J 4 ,A s ) and A n = diag(Ai,..., A n ). Further dehne A p = 
diag(l,..., 1, A m+ i,..., A p ) and consider the eigenvalue of the matrix n~ l ZA p Z'. The j th 
diagonal element of the matrix must lie in between its minimum and maximum eigenvalues. 
That is 


A„(izA„Z')< (izA,Z') = A,]T| 

k= 1 


1^<mIza p z’), 


where fjk is the fc-th element of the j th empirical eigenvector for j < m. Note that Xj/Xj 


r the left and right hand side converge in probability 
thus to cpj(n\j). So YIk=\ £]kMJX k = 0 P (l/n). 


D.l 


converges to 1, and decided by A j bot 
to (m + E ?= m +1 A j)/(nXj) by Lemma 

Also, by definition, A^/A*, = 0(1/p) for k < m while the ratio is 1 for k > m. Hence, 
£fc=m+i$k = Op(l/n + l/p), which implies that \\£ jA \\ = 1 - ELm+i £% = 1 + 0 P (n- 1 + 

_1 ) and ||£ s || = 0 P (n~ l/2 +p~ l/2 ). □ 


P 


C Proofs in Section 4 


Let us introduce some additional notations and two lemmas in order to prove Theorem 
4.1 Assume n is even, otherwise we can drop one sample without affecting the asymptotics. 
Let n = n/2. For any permutation a of {1, 2,..., n}, let (R, i 2 , ■ ■ ., i n ) ■— cr(l, 2,..., n). For 
r = 1,..., n, we dehne and K ff : 


w a r = Ap g r /(g/A p g r ) 2 so that fc(X l2r _ 15 X i2r ) = wfwf , 


(C.l) 


1 

K„ — — ^ w^w^' so that K = 


n 


card(5 n ) 


X K - 


(C.2) 


r =1 v "" a&S n 

where S n is the permutation group of {1,2,..., n}. For each hxed permutation a, we have 
the following two conclusions on empirical eigenvalues and eigenvectors of similar to the 
sample covariance for sub-Gaussian factor models. 


Lemma C.l. Under Assumptions \2.1 and f.l. for j < m, we have 

iy-»,l = Op(ii- 1/2 ), 


(C.3) 
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and for j > m, AJ = Op{n x ) where {AJ }’s are eigenvalues of K CT . 

Now consider the leading empirical eigenvectors ^ ■ of K CT , j < m. Each £■ is divided 
into two parts £ ■ = ((£j A y, (^,b) / ) / where £ ?A is of length m. 


Lemma C.2. Under Assumptions \ 2. l\ and f.l, for j < m, we have 


IICa-^II = 0 P {n 1/2 ). 

In addition, ||£J S || = O p (n ~ 1//2 + p~ 1 / 2 ). 

With the above two lemmas, we prove Theorem EH 
Proof of Theorem 14. 11 (i) First, we have the simple fact that 


IK - K|| = 


card(5„ 


O-gcSn 


K„ — K 


< 


card(iS„ 


E 

treSn 


K„ — K 


Now let us derive the rate of ||K CT — K||. Write 


(C.4) 


K ff — K = (r 1 .r 2 )diag(© A , 0 B )(f r, r 2 )' - diag( 0 A , ©p), 


where © A = diag (9i,...,9 m ) and ©p = diag {Q m+1 ,... ,6 P ). From Lemma C.l. we have 


||0 A — © A || = Op(n 1 / 2 ) and ||©p|| = Op[n x ). From Lemma 
Op(n ~ 1 / 2 ). Therefore, the following two bounds hold: 


C.2 


we have ||Li — (I m , 0 )'|| = 


ri© A i\ — diag(© A ,o)|| <||f i(@a — ©vi)r x || 

+ ||(fr - (i m ,o)')© A (r 1 - (I m ,0)')'|| 

+ ||(i m ,o) , 0 A (f 1 » (I m , 0)711 
+ ||(fi - (I m ,0) , )© A (I m ,0)|| , 


which is Op{n l ' 2 )] in addition, 

||r 2 ©Bf 2 — diag(0, ©p)|| < ||r 2 || 2 ||©p|| + ||©s|| = Op(l/n + 1/p ), 

where ||@p|| = 0(1/p). Therefore, ||K CT — K|| = Op(t 7 , -1 / 2 ) for any fixed permutation cr, 
which implies that ||K — K|| = Op(n~ Z 2 ). This further implies conclusion (i) by Weyl’s 
inequality and ||£ ;S || = Op (nr 1 / 2 ). 

(ii) We first prove the following conclusion: there exists diagonal scaling random matrix 
Dj and random vector h such that ||Dj£- s — h|| = Op (^/log n / (np )) where h is uniformly 
distributed over the centered sphere of dimension p — m and radius r = Opfn^ 1 ^ 2 ). 
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To this end, we need to employ rescaled data xf = diag(I m , f2 0 ) x i where f2 0 = 
diag(y / c/A m+ i,..., y/c/Xp ). Here superscript R denotes rescaled data by diag(I m , n 0 ). Re¬ 
call that Xj follows EC(0, A p , (). After rescaling, xf follows EC(0, diag(A J 4 , cl p _ m ), £). Let 
At = n{n — l)/2 and define a N xp matrix X such that A^yy.) = (xj — Xj/)'/||xj — Xj/||, which 
corresponds to each pair of samples. Clearly K = N 1 X'X. Let X = (Xa,Xb), and corre¬ 
spondingly X R = £Adiag(I m , f2 0 ) = {EX A , £X b E1 0 ) so that X R = CX A and X B = £AgO 0 , 
where 

C = diag(||x.j - x,||/||xf - xf ||) . 


iR 


Other quantities are also defined for the rescaled data. For example, £ ■ and uf are eigen¬ 
vectors of K A> and K R = N~ 1 X R X R> . Let ^ = (^ai^b)'- 

Since the estimator K. R is invariant to orthogonal transformation of the data, similar to 
(2007), we can show ^ jS /||^ B || is distributed uniformly over the unit sphere. Define 

t 

h = £j B . From the proof of (i), we know ||h|| = Op(n -1 / 2 ). So h is uniformly distributed 
over a centered ball of radius Op(n ~ 1 / 2 ). Hence it only remains to bound the difference of 
and h to validate the claim. 


Paul 


sjB 


Note that 


K - K*|| < || N~\X a X' a - CX a X' a C)\\ + || N~\X b X b - CX B EtlX' B C)\\ . 


It is not hard to show ||£ — Ijv|| = O p {yJ\ogn/p), which is in the same order as the first 
term. The second term is dominated by Op(||£ — I/v||) plus 


II N-'Xsiip-m - tot)x' B \\ = ll^ _ (I - n'fcx'Bx B {i - n 


§)*l 


1 II 1 

V n> cr£S n r= 1 


1 /2 x 

where wfp = Af g r B/(g/A p g r ) 2 . Using the notations defined in the proof of Lemma 
we have 


C.l 


r =1 


-£(i-nUw"flW r y(i-nU = ^lRbAb /2 (i - r' b l 


where (Rp)*. = gw/y/P and 

L = diag((p _1 g 1 / Apgi) _ 5,..., ( p ~ l g Tl 'A p g n ] 


The right hand side is of order Op(n 1 yjn/p) = Op(yJl/(np )) by Lemma D.l. This implies 
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|K — K 7l> || = Op(\J\ogn/p). Thus by the sin 9 theorem of 


Davis and Kalian 


(1970), we get 


|u j — u^|| = Op(y/\ogn/p). With (B.l), we have 


VAfO ( „£, /? - h|| = 


QqX'bUj X b uf 


N\? 


N if 


< 11^01 




NXf 


U.; 




R | 


The right hand side is Oplogn/{np)) since from above ||uj — u^|| = Op(^/logn/p), 
||£ — I|| = O P (sJ\ogn/p), Xf converges to 9j and ||d^/-N/]V|| = 0 P (n -1 / 2 ), which is true 
because 


N-'X’pXp 


< 


EllsE 


card (‘ s “' " n ,.i 


W rB W rB 


= 0 P (l/n ), 


where the last equality can be seen from the proof of Lemma C.l Hence we conclude, if 


D, = J XMSh, 


Ojl B - h|| = Op(yiogn/(np)). 


We are done proving the claim. 

Now let us come back to our goal of bounding ||f2^j S || m ax for any p by p — m matrix f2 
such that fi'fi = Ip-m- Obviously, 


| 0 £ 


jB I 


< ||OD ■ — h-) 11 max + ||f2D • 1 h|| max — / + //. 


Let £1' = ( 7 1 ,..., 7 p ). Then, 

/ < maxIlDT^IIHD^ s - h|| < 0 P (1)||D ^ jB - h|| = 0 P (y/\ogn/(np )), 


k<p 


where the second inequality is due to A*, < C 0 for k > m, || 7 fc || < 1 and Xj/Xf = 

I + o P (l). Thus to bound the elementwise sup-norm ||f2£ ; - p ||max, it suffices to show 

II = 0 P (A/log p/ (np )). 

Let G = (Gi,..., G p _ m ) be standard normal distributed. Obviously, h = ||h|| • G/||G|| 
where G/ 1| G || is uniform over unit sphere of dimension p—m. Provided ||h|| = 0 P (n O 2 ), we 
only need to show ||f2Dj 1 G|| max /||G|| = Op(y/\ogp/p). It follows from Xj/Xf = 1 + o P (l) 
and 


max |7l.O 0 X G|/1|G|| = 0 P (^\ogp/p ), 

k<p 


since 7'f2 0 G is normally distributed with bounded variance. This completes the proof for 


ii . 


□ 
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Proof of Lemma 1C. 11 Recall that 


1 ^ ^ 

\< = Ap 2 g r /(g r / A p g r )5 and K a = - V" . 

n 

r =1 

For ease of notation, let us assume a is the identity permutation and ignore the index a 
in the following. Define W = (wf,... ,w?)' = (wi,...,w p ) = (Z^A 1 / 2 , Z b A]/ 2 ) where 
Z A = (zi,... ,z m ), Z B = (z m+1 ,... ,z p ) with 

Z j L(5 , i j, . . . , Qrj) /yfP 


and 

L = diag((p' 1 gi , Apg 1 )“^,..., (p _ 1 gfi'Apg fi )“*). 

Then, K ff = hWW'W. Exchanging W' and W, we further define K CT = n _ 1 WW / , which 
share the same nonzero eigenvalues as K ff . Now in order to prove the lemma, let us decom¬ 
pose K ff = A + B where 


m p 

A = h -1 XjZjZj, and B = h -1 AjZjz'-. 

3 =1 j=m+l 


We deal with A Erst. Note that A has the same eigenvalues as matrix A = 
u-'A^Z'aZaA] 2 , where Z a is an n x m matrix with iid rows. Therefore, we are in the low 
dimensional situation with fixed dimension m. It is easy to see that 


(A) 




-i n 

1 yr^g, 


i fjr j \J A i Xj 


n 


ELi X k9 


2 rk 


1 


and thus by the central limit theorem, the j th diagonal element of A is 9j + Op(n~ 1,/2 ) and the 
off-diagonal elements are of order Op(n^ 1 / 2 ). Therefore, write K = cliag(©n, ©b), 
we have A = ©a + H with ||H|| = Op{n~ 1 ^ 2 ). By Weyl’s inequality, 


|Aj(A) - Qj\ = |Aj(A) - Aj( 0 a )| < ||H|| = 0 P (rC 1 / 2 ). 


By the definition of B, B = n 1 ZbAbZ' b , where the i th row of Zb is (Zb);. = 
gis/HApAgj||. Let Zb = LRb where (Rb);. = g w/y/P- Therefore 

h-Afc(B) = A^ZbAbZ^) < Afc(RBABRn)A max (L"). 
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Since each row of Rp is Gaussian, by Lemma D.l with t = yn, for any k < n, 

1 


Aa/RbApR^) — 


P — 771 

^ j=m +1 


E x ’ + 0 p Up)- e+ ° p {Jp )+ o(1) - 


In addition, we have A max (L 2 ) = 0p( 1). This is because 

\ p -l i v -il 

A max (L 2 ) = (min - V < ( min - V Xjf/f) = - + 0 P (^\ogn/p). 

J=1 j=m+l 

Therefore, Afc(B) = Op(n _1 ). By Wely’s Theorem, Aj(A) + A n (B) < XJ < Aj(A) + Ai(B). 
Therefore, we conclude that |AJ — 0j\ = Op(n _1//2 ) for j < m and AJ = Op(n^ 1 ) for 


J > 771. 


□ 


Proof of Lemma 1C. 21 Similar to Lemma B. 1 , we can prove ||£ Jj4 || = 1 + Op(l/n + 1/p) 

as well as ||£- s || = Op(n _1//2 + p -1 / 2 ). If m = 1, obviously the conclusion holds. So in the 

following we assume m > 1 . 

11 ~ 

Write W = (Z^A^, ZpAjj). Dehne u,, as the eigenvector of = n _ 1 WW / . We obtain 

1 u —ZyiA^Z^juj Diij A 11 ,, 

where we denote D = n _ 1 ZpApZ' s , A = Xj — 8j. W e then left multiply the above equation 
by A 1 a 2 Z' a / J nXj and employ the relationship (B.2) to replace in,- by £ A and £ - B to obtain 


Ojln - 0A )i- A =i UXXZaA!/ 2 - & A )i’ A + W^DZ'^A 


n 


uXa 


aTz'JIZbA] i 2 


(C.5) 


nAi 




Further, we dehne 


* = E 


fce[m]\j 


- Ok 


~ e kA e kA 


Then we have R(I m — A A /Xj) = I m — Left multiplying R to (C.5), 

£jA ~ (£jAi e jA) G jA ~ RH^ a + R 


A 1 / 2 7 ,/ T)7 A 1 / 2 
A a L a UL b J^ b -a 


tiXa 


where H = n 1 A 1 a 2 Z’ a Z a A l a z — + (nAj) 1 A A /z Z( 4 DZ y 4 A A /z . Dividing both sides by 


1/2 


-1 A 1 / 2r Z / 


1/2 


37 















||IJaII> we S et 


where 


= (<A\> e iA> - 1 ) e iA + RH (4^ - e iA) 
IIsjaII IIsjaII 


£ja/II£jaII e jA — R-He^A + r n , 


) e jA + RHI — e jA 


(C. 6 ) 


R 


ATz r A -DZ B Af I] B 


(C.7) 


nX-i 


AR 


sj'Al 


sjA 


■'jA 


*jA I 


Following Fan and Wang (2015), we are able to show ||r n || = op{n 2). In addition, from 
the proof of Lemma C.l, we have ||D|| = ||n l Z B A B Z' B \\ = Op{n x ) = op(n 2). So 


HI < 


n-'Afz' A Z A Af - ©4 


ID 


- a][ 2 z' a z A a}J^ 


n 


= Op{n 2). 


This gives H^a/II^/aII — e jA|| = Op{n 1//2 ) since clearly ||R|| = 0(1). Together with \\$,j A \\ = 
1 + Op(l/n + 1/p), it follows ||^ /j4 — 1| = Op (nr 1 / 2 ). The proof is complete. □. 


D A Technical Lemma 


Recall that z, = A p 1//2 Xj is the standardized version of the transformed data x*. We have 
the following theorem for z,, which will be useful for the proofs in Section [3] and [4j 

Lemma D.l. Let Z be the n x p matrix (p > n) with rows z'. Assume z, to be iid sub- 
Gaussian random vector with ||z ?: || 0 2 = sup ug5P -i |(zj,u)|^ 2 < M for some constant M > 0. 
Condition (3.1) holds for z*. The columns of Z are denoted by z ) of length n. Then for 


Vt > 0, let 5 = Co+ -^=, we have 


p 

< max{<5 2 , S} , (D.l) 

J i=l 

with probability at least 1 — 2exp(—cot 2 ), where Co,co > 0 depend on M,Mi,M 2 . Here, \wj\ 
is bounded from above for all j and w = p _1 Xp=i W P 

Proof. Without loss of generality, let us assume all the w/s are non-negative and bounded 
away from zero. Otherwise, we subtract the minimal one from all the w/s. Let the new 
non-negative weights to be Wj = Wj — w m i n + 1. Since all the n eigenvalues concentrate to 
the same number, it is easy to separately consider the concentration for \i(p~ l Y^=i WjZ,,z'-) 
and Aj(p -1 zjz)), which both have nonnegative lower bounded weights. 


max 

i<n 




WjZj z j 


— w 
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Let D = diag(wi,..., w p ), so p 1 Y^=\ = P 1 ZDZ / . Assume without loss of 

generality that Wj is decreasing and w p = 1. First we have 


Ar 


-ZDZ 7 ) = A^f-D^Z'ZD 1 / 2 ) < A max (D)A max (-Z'Z 


P 


P 


P 


Since z, : is sub-Gaussian, by Theorem 5.39 of Vershynin (2010), we have with probability at 
least 1 — 2exp(—cit 2 ), 


A r 


^z'z) < (i + Ci\/- + ~^=) 2 ■ 


n 

If t > Ci^/vJip, without loss of generality we assume C\ > 1. Then since |m| < w\ < 5 2 , 
the minimum eigenvalue of p _1 ZDZ' satisfies the conclusion. It remains to validate the 
conclusion for the maximal eigenvalue. According to the above, A max (D)A max (Z / Z/p) is 
bounded by 


Wl-(l + CiJ- + < ~(y/wi + + — + 

n \ \l n yJnJ p \ V n 


n 


(V& + + (1+ ^ I)t ) 2 < w + max{5 2 , S} . 


Thus the theorem holds for t > C\^Jw\p. 

We only need to consider the case t < Cx^Jwpp. This corresponds to conditioning on 
the event S = {A max (ZDZ'/p) < Cf} where C 2 > y/w{{l + Ci + Ciydhi). Obviously, with 
probability at least 1 — 2exp(—C]t 2 ), the event S holds. To prove (D.l), it suffices to show 
that with high probability 


< 5 , 


-ZDZ'-ml 

< 2 max 

-IlD^Zxll 2 -w 

P 

xe.'V 

P 


where A f is the I-net covering the unit sphere S n 1 and \J\f\ < 9 n (Vershynin, 2010). Due 
to the following decomposition, 


■ I_ ’2 ■* 

iD^Z'xll 2 = y^ XjY)2 Zi = ^ Wj + ^ x 2 (z.Pzt - tr(D)) + XjXkZ^Dzk , 

i j= 1 i=1 jjLk 


we have 
1 


— IID 2 Z'xll 2 — w 
P 


< 


- y^x 2 (z-Pz i - tr(D)) + - XjXkZ^Bzk = : |A x | + |A 2 | 

P i =1 P 
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Therefore, 


pf 

-ZDZ' - wl 

< pf max 

V 

p 

) ~ V xeA/" 


-||D5Z'x|| 2 

p 


w 


> 5/2 


< P^ max |Ai| > <5/4^ + P^ max IA 2 I > <5/4^ . 


(D.2) 


We need to separately bound the two terms on the right hand side. 
Since |Ai| < maxj< n |z'Dz* — tr(D)|/p, 


pf max IAJ > <5/4^) < LA/" I • n ■ max pf Iz'Dzj — tr(D)| > — V (D.3) 

V xeA f / x£j\f,i<n V 4 / 


For a fixed x and i, we now bound P(|z'Dzj — tr(D)| > 5p/ 4). By Lemma D.2, choosing 


A = D 1 / 2 / y/p, since z* satisfies (13. lb , we have 

P(jz'Dz i - tr(D)|/p > C^w i(y/~ + “)) - 3exp(—w), 


where C^ is dehned in Lemma D.2 and is bounded since i/i = W\ is bounded. Choose 
u = C 3 5 2 p so that u/p < 1 and C 3 < l/(64Cy,Wi), which implies 5/4 > 8C^Wi^/u/p > 
C^wi(^/u/p + u/p). So P(|z'Dzj — tr(D)| > 5p/ 4) is bounded by 3exp (—C 3 5 2 p). Therefore, 


from (D.3), we have 

p( max I Ad > 5/4^ < 9 n • n ■ 3exp(-C 3 5 2 p) < exp (-c 0 f 2 ) 


by choosing Cq in the dehnition of 5 large enough. This proves the first term in (D.2). 


For the second term in (|D.2|), we apply the decoupling technique. By Lemma 5.60 of 

4 


Vershynin ( 2010 ), 


A 2 | < - max 

p TC[n] 


.r,.r/,z' I)Z/ 

jeT,fceT c 


So we have 


P^max|A 2 | > 5/A^j < |A/’||T[ • rnaxP^ XjX k z'jDz k 


j&T,keT c 


5p 

> 16 


(D.4) 


For each fixed x and T, we first consider the above probability conditioning on z k for k € T c . 
Let Hj = J2keT c x k z 'j^ z k = z'-DZycXyc where Z 7 -c is constructed by columns z k for k G T c 
and X 7 -C contains the coordinates of x corresponding to T c . We know Hj is sub-Gaussian 
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since 


mi*, < 11zj-11 0 2 HD 1 / 2 1| ||D ly/ 2 Z r c||||x T c|| < v^r||z i || 02 ||D 1 / 2 Z|| < C 2 M^p, 

where the last inequality is due to conditioning on the event E. So there exists a constant 
C 4 > 0 independent of Z-yc such that ||hfj||</, 2 < C 4v /p. Furthermore, the weighted sum of 
Hj 's is also sub-Gaussian distributed with || x jHj\\ ( j >2 — 


Hence, from (D.4), 


max |A 2 | >5/4J <9 n -2 n -E^P[j y. x i H i 

V ier 


5p 

> 16 


j r c 


where the right hand side is bounded by 


9 n ■ 2 n ■ 2e - c ‘ 5 V/(256Ci P ) < exp(—c 0 f 2 ) 


by choosing a large enough Cq in the definition of S. So we bounded the second term. 


To conclude, from (D.2), 


P( jj-ZDZ' — wl|| > 5) < 2exp(—c 0 f 2 ), 


which implies (D.l). 


.□ 


Lemma D.2. Let A be a m by n matrix and £ := A'A. Suppose x = (aq,... ,x n ) is an 
isotropic sub-Gaussian random vector, that is, 


E[exp(a'x)] < exp(|| ck|| 2 /2), 


for all ct G M n . For all t > 0, 


pfllAx|| 2 > tr(E) + 2^/tr(E 2 )t + 2||£||f) < e“‘; 


(D.5) 


if furthermore the SVD decomposition of A = UDV' where D is a m by m diagonal matrix 


and U, V consist of left and right orthogonal singular vectors and (3.1) holds for V'x, we 
have, 

P(j|Ax|| 2 < tr(£) - CV^/tr(S 2 )t - €V||£||i) < 2e~ t , (D.6) 

where C ^ = ma,x{2if + 2if\l M 2 , 2 + 2 Mf 1 } and if = Ai(D 2 )/A m (D 2 ) is the condition number 

of 5L 
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The above lemma is an extension of the exponential inequality for iid one dimensional 


sub-Gaussian variables proved by Laurent and Massart (2000). ft is the Hanson-Wright 


inequality for the quadratic functional of a sub-Gaussian random vector. Rudelson and 


Vershynin (2013) showed this inequality for independent sub-Gaussian elements. Hsu et al. 


(2012) obtained the upper tail bound (D.5) under a much weaker assumption of general sub- 


Gaussian vector with dependency. However, they did not provide result for the lower tail 
bound. Note that quadratic functionals are different from linear functionals in that changing 


the sign of x does not naturally give the lower tail bound. In the following, we prove (D.6) 


under (3.1). This bound is used for proving Lamina D.l 


Proof. Denote y = V'x G M m so that || Ax|| 2 = y , D 2 y. Write D 2 = diag(di,..., d m ) with 


decreasing diagonal elements. Since (3.1) holds for y, we have for 9 < Mi, 


P 


^y'y < m — 2\/mM 2 t — 2 M l < exp ^ — 2 9\]mM 2 t — 29M 1 1 t + M 2 9 2, m 


Choose 9 = \Jtf ( M 2 m ) which is smaller than M\ if t < mMfM 2 while choose 9 = Mi if 
t > mMfM 2 . In any case, we can show that the right hand side is bounded by exp(— t). 
Define an event £ = {y'y > m — 2 y/mM 2 t — 2 Then P(£ c ) < exp (—t). Futhermore, 
we define 

,4={ y'D 2 y<]T 

t 'j 


'j<m 


dj C ■ih \ t 




So ( D.6| ) is equivalent to P(^4) < 2exp(— t). Obviously P(^lp|£’ c ) < exp(— t). We bound 
P(pfn^) as follows: 


^ P (yWm - D 2 )y > tr(d 1 I m - D 2 ) 

+C*lfr„ ~d 2 j + C^dit — 2d\ y/ mM- 2 t — 2diMf 1 t j 
< pfy'(dilm - D 2 )y > tr{dil m - D 2 ) 


+2\/t^ < (di - dj) 2 + 2 (di - d m )t ^ < exp(— t ), 


where the first inequality is a summation of the inequalities defined in events A and £] 
the second inequality is due to the fact d 2 > d\ \frri > ~2j< m (di — dj) 2 where 

if = di/d m and the last inequality is by (D.5). Thus we have proved P(yf) < 2exp(— t). □ 
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